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In an effort to better understand meaning from natural language texts, we explore methods aimed 
at organizing lexical objects into contexts. A number of these methods for organization fall into a 
family defined by word ordering. Unlike demographic or spatial partitions of data, these collocation 
models are of special importance for their universal applicability. While we are interested here in 
text and have framed our treatment appropriately, our work is potentially applicable to other areas 
of research (e.g., speech, genomics, and mobility patterns) where one has ordered categorical data, 

(e.g., sounds, genes, and locations). Our approach focuses on the phrase (whether word or larger) as 
the primary meaning-bearing lexical unit and object of study. To do so, we employ our previously 
developed framework for generating word-conserving phrase-frequency data. Upon training our 
model with the Wiktionary—an extensive, online, collaborative, and open-source dictionary that 
contains over 100, 000 phrasal-definitions—we develop highly effective filters for the identification of 
meaningful, missing phrase-entries. With our predictions we then engage the editorial community 
of the Wiktionary and propose short lists of potential missing entries for definition, developing a 
breakthrough, lexical extraction technique, and expanding our knowledge of the defined English 
lexicon of phrases. 

PACS numbers: 89.65.-s,89.75.Fb,89.75.-k,89.70.-a 

I. BACKGROUND 

Starting with the work of Shannon [1], information 
theory has grown enormously and has been shown by 
Jaynes to have deep connections to statistical mechan¬ 
ics [2]. We focus on a particular aspect of Shannon’s 
work, namely joint probability distributions between 
word-types (denoted w G W), and their groupings by 
appearance-orderings, or, contexts (denoted c G C). For 
a word appearing in text. Shannon’s model assigned con¬ 
text according to the word’s immediate antecedent. In 
other words, the sequence 

• • • Wi-i Wi ■ ■■ 

places this occurrence of the word-type of Wi in the con¬ 
text of Wi-i * (uniquely defined by the word-type of 
Wi-i), where denotes “any word”. This experiment 
was novel, and when these transition probabilities were 
observed, he found a method for the automated produc¬ 
tion of language that far better resembled true English 
text than simple adherence to relative word frequencies. 

Later, though still early on in the history of modern 
computational linguistics and natural language process¬ 
ing, theory caught up with Shannon’s work. In 1975, 

Becker wrote [3]: 
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My guess is that phrase-adaption and gen¬ 
erative gap-filling are very roughly equally 
important in language production, as mea¬ 
sured in processing time spent on each, or in 
constituents arising from each. One way of 
making such an intuitive estimate is simply 
to listen to what people actually say when 
they speak. An independent way of gauging 
the importance of the phrasal lexicon is to 
determine its size. 

Since then, with the rise of compntation and increasing 
availability of electronic text, there have been numerous 
extensions of Shannon’s context model. These models 
have generally been information-theoretic applications as 
well, mainly used to predict word associations [4] and to 
extract multi-word expressions (MWEs) [5]. This latter 
topic has been one of extreme importance for the com¬ 
putational linguistics community [6], and has seen many 
approaches aside from the information-theoretic, includ¬ 
ing with part-of-speech taggers [7] (where categories, e.g., 
noun, verb, etc. are used to identify word combina¬ 
tions) and with syntactic parsers [8] (where rules of gram¬ 
mar are used to identify word combinations). However, 
almost all of these methods have the common issue of 
scalability [9] , making them difficult to use for the extrac¬ 
tion of phrases of more than two words. 

Information-theoretic extensions of Shannon’s context 
model have also been used by Piantadosi et al. [10] to 
extend the work of Zipf [11], using an entropic derivation 
called the Information Content (IC): 
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and measuring its associations to word lengths. Though 
there have been concerns over some of the conclusions 
reached in this work [12-15], Shannon’s model was some¬ 
what generalized, and applied to 3-gram, 4-gram and 5- 
gram context models to predict word lengths. This model 
was also used by Garcia et al. [16] to assess the relation¬ 
ship between sentiment (surveyed emotional response) 
norms and 1C measurements of words. However their 
application of the formula 

, /(“) 

-fM =- 7 ^ I C,), (2) 

to 7V-grams data was wholly incorrect, as this special 
representation applies only to corpus-level data, i.e., plot 
line-human readable text, and not the frequency-based 
iV-grams. 

In addition to the above considerations, there is also 
the general concern of non-physicality with imperfect 
word frequency conservation, which is exacerbated by the 
Piantadosi et al. extension of Shannon’s model. To be 
precise, for a joint distribution of words and contexts 
that is physically related to the appearance of words on 
“the page”, there should be conservation in the marginal 
frequencies: 

fiw) = ^ f{w,c), (3) 

cGC 

much like that discussed in [4]. This property is not 
upheld using any true, sliding-window A^-gram data 
(e.g., [17-19]). To see this, we recall that in both of [16] 
and [10], a word’s A^-gram context was defined by its 
immediate A^ — 1 antecedents. However, by this formula¬ 
tion we note that the first word of a page appears as last 
in no 2-gram, the second appears as last in no 3-gram, 
and so on. 

These word frequency misrepresentations may seem to 
be of little importance at the text or page level, but since 
the methods for large-scale A^-gram parsing have adopt¬ 
ed the practice of stopping at sentence and clause bound¬ 
aries [19], word frequency misrepresentations (like those 
discussed above) have become very significant. In the 
new format, 40% of the words in a sentence or clause of 
length five have no 3-gram context (the first two). As 
such, when these context models are applied to modern 
A^-gram data, they are incapable of accurately represent¬ 
ing the frequencies of words expressed. We also note that 
despite the advances in processing made in the construc¬ 
tion of the current Google A^-grams corpus [19], other 
issues have been found, namely regarding the source texts 
taken [20]. 

The A^-gram expansion of Shannon’s model incorpo¬ 
rated more information on relative word placement, but 
perhaps an ideal scenario would arise when the frequen¬ 
cies of author-intended phrases are exactly known. Here, 
one can conserve word frequencies (as we discuss in sec¬ 
tion H) when a context for an instance of a word is defined 


by its removal pattern, i.e., the word “cat” appears in 
the context “* in the hat”, when the phrase “cat in the 
hat” is observed. In this way, a word-type appears in 
as many contexts as there are phrase-types that contain 
the word. While we consider the different phrase-types 
as having rigid and different meanings, the words under¬ 
neath can be looked at as having more flexibility, often 
in need of disambiguation. This flexibility is quite sim¬ 
ilar to an aspect of a physical model of lexicon learn¬ 
ing [21], where a “context size” control parameter was 
used to tune the number of plausible but unintended 
meanings that accompany a single word’s true meaning. 
An enhanced model of lexicon learning that focuses on 
meanings of phrases could then explain the need for dis¬ 
ambiguation when reading by words. 

We also note that there exist many other methods for 
grouping occurrences of lexical units to produce infor¬ 
mative context models. As early as 1992 [22], Resnik 
showed class categorizations of words (e.g., verbs and 
nouns) could be used to produce informative joint proba¬ 
bility distributions. In 2010, Montemurro et al. [23] used 
joint distributions of words and arbitrary equal-length 
parts of texts to entropically quantify the semantic infor¬ 
mation encoded in written language. Texts tagged with 
metadata like genera [24], time [25], location [26], and 
language [27], have rendered straightforward and clear 
examples of the power in a (word-frequency conserving) 
joint pmf, at shedding light on social phenomena by relat¬ 
ing words to classes. Additionally, while their work did 
not leverage word frequencies or the joint pmf’s possi¬ 
ble, Benedetto et al. [28] used metadata of texts to train 
language and authorship detection algorithms, and fur¬ 
ther, construct accurate phylogenetic-like trees through 
application of compression distances. Though metadata 
approaches to context are informative, with their power 
there is simultaneously a loss of applicability (metada¬ 
ta is frequently not present), as well as a loss of bio- 
communicative relevance (humans are capable of infer¬ 
ring social information from text in isolation). 


II. FREQUENCY-CONSERVING CONTEXT 
MODELS 

In previous work [29] we developed a scalable and 
general framework for generating frequency data for N- 
grams, called random text partitioning. Since a phrase- 
frequency distribution, S, is balanced with regard to its 
underlying word-frequency distribution, W, 

wGW sGS 

(where £ denotes the phrase-length norm, which returns 
the length of a phrase in numbers of words) it is 
easy to produce a symmetric generalization of Shan¬ 
non’s model that integrates all phrase/A-gram lengths 
and all word placement/removal points. To do so, we 
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TABLE I: A table showing the expansion of context lists for 
longer and longer phrases. We define the internal contexts 
of phrases by the removal of individual sub-phrases. These 
contexts are represented as phrases with words replaced by 
*’s. Any phrases whose word-types match after analogous 
sub-phrase removals share the matching context. Here, the 
columns are labeled 1-4 by sub-phrase length. 


define W and S to be the sets of words and (text- 
partitioned) phrases from a text respectively, and let C 
be the collection of all single word-removal patterns from 
the phrases of S. A joint frequency, f{w,c), is then 
defined by the partition frequency of the phrase that 
is formed when c and w are composed. In particular, 
if w composed with c renders s, we then set /(w, c) = 
/(s), which produces a context model on the words whose 
marginal frequencies preserve their original frequencies 
from “the page.” In particular we refer to this, or such a 
model for phrases, as an ‘external context model,’ since 
the relations are produced by structure external to the 
semantic unit. 

It is good to see the external word-context generaliza¬ 
tion emerge, but our interest actually lies in the develop¬ 
ment of a context model for the phrases themselves. To 
do so, we define the ‘internal contexts’ of a phrase by the 
patterns generated through the removal of sub-phrases. 
To be precise, for a phrase s, and a sub-phrase rang¬ 
ing over words i through j, we define the context 

Ci...j =Wi ■■■ Wi-i * • • • * Wj+i ■ ■ ■ Wg^s) (5) 

to be the collection of same-length phrases whose analo¬ 
gous word removal {i through j) renders the same pattern 
(when word-types are considered). We present the con¬ 
texts of generalized phrases of lengths 1-4 in Tab. I, as 
described above. Looking at the table, it becomes clear 
that these contexts are actually a mathematical formal¬ 
ization of the generative gap filling proposed in [3] , which 
was semi-formalized by the phrasal templates discussed 
at length by Smadja et al. in [5]. Between our formula¬ 
tion and that of Smadja, the main difference of definition 
lies in our restriction to contiguous word sequence (i.e., 
sub-phrase) removals, as is necessitated by the mechan¬ 
ics of the secondary partition process, which defines the 
context lists. 


The weighting of the contexts for a phrase is accom¬ 
plished simultaneously with their definition through 
a secondary partition process describing the inner- 
contextual modes of interpretation for the phrase. The 
process is as follows. In an effort to relate an observed 
phrase to other known phrases, the observer selectively 
ignores a sub-phrase of the original phrase. To retain gen¬ 
erality, we do this by considering the random partitions 
of the original phrase, and then assume that a sub-phrase 
is ignored from a partition with probability proportional 
to its length, to preserve word (and hence phrase) fre¬ 
quencies. The conditional probabilities of inner context 
are then: 

P{c^...j I s) = 

P(ignore Si...j given a partition of s) = 

( 6 ) 

P(ignore Si...j given Si...j is partitioned from s)x 

P{si...j is partitioned from s). 

Utilizing the partition probability and our assumption, 
we note from our work in [29] that 

^(s) = i{s,...j)Pq{Si...j I s), (7) 

l<2<J<.^(s) 


which ensures through defining 

I s) = I s), (8) 

the production of a valid, phrase-frequency preserving 
context model: 


s) 

cec 


= P(^i-3 I «)/(«) 

i<j<l{s) 


=/(s) II 


£(s^^ 

t{s) 


Pq{Si--j 


s) 


f{s), 

(9) 


which preserves the underlying frequency distribution of 
phrases. Note here that beyond this point in the docu¬ 
ment we will used the normalized form, 


P(c,s) 


/(c, s) 

E E /(c,s)’ 

sGScGC 


( 10 ) 


for convenience in the derivation of expectations in the 
next section. 


III. LIKELIHOOD OF DICTIONARY 
DEFINITION 

In this section we exhibit the power of the internal con¬ 
text model through a lexicographic application, deriving 
a measure of meaning and definition for phrases with 
empirical phrase-definition data taken from a collabo¬ 
rative open-access dictionary [30] (see Sec. V for more 
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information on our data and the Wiktionary). With the 
rankings that this measure derives, we will go on to pro¬ 
pose phrases for definition with the editorial community 
of the Wiktionary in an ongoing live experiment, dis¬ 
cussed in Sec. IV. 

To begin, we define the dictionary indicator, D, to be 
a binary norm on phrases, taking value 1 when a phrase 
appears in the dictionary, (i.e., has definition) and tak¬ 
ing value 0 when a phrase is unreferenced. The dictio¬ 
nary indicator tells us when a phrase has reference in the 
dictionary, and in principle can be replaced with other 
indicator norms, for other purposes. Moving forward, we 
note of an intuitive description of the distribution aver¬ 
age: 

tes 

= P(randomly drawing a defined phrase from S), 

and go on to derive an alternative expansion through 
application of the context model: 

Dis) = Y,mp{t) 

tes 

= Y,D{t)P{t)Y,P{c\t)Y,P{s\c) 

tes cgc ses 

cgc tes ses 

= E^(^)E^(^i^)E^w^(^i^) 

cec ses tes 

ses cec tes 

= '^P(s)J^P(cls)B(clS). 

ses cgc 

In the last line we then interpret: 

D{C\s) = Y,P{c\s)D{c\S), (12) 

cGC 

to be the likelihood (analogous to the IC equation pre¬ 
sented here as equation 1) that a phrase, which is ran¬ 
domly drawn from a context of s, to have definition in 
the dictionary. To be precise, we say D{C \ s) is the 
likelihood of dictionary definition of the context model 
C, given the phrase s, or, when only one c G C is con¬ 
sidered, we say D{c \ S) = J2tes D{t)P{t \ c) is the like¬ 
lihood of dictionary definition of the context c, given S. 
Numerically, we note that the distribution-level values, 
D{C I s), “extend” the dictionary over all S, smooth¬ 
ing out the binary data to the full lexicon (uniquely for 
phrases of more than one word, which have no interesting 
space-defined internal structure) through the relations of 
the model. In other words, though D{C \ s) ^ Q may now 
only indicate the possibility of a phrase having definition, 
it is still a strong indicator, and most importantly, may 
be applied to never-before-seen expressions. We illus¬ 
trate the extension of the dictionary through I? in Fig. 1, 


in the contrary {D = 0) on the contrary {D = 1) 



FIG. 1: An example showing the sharing of contexts by simi¬ 
lar phrases. Suppose our text consists of the two phrases, “in 
the contrary” and “on the contrary”, and that each occurs 
once, and that the latter has dehnition {D = 1) while the 
former does not. In this event, we see that the three shared 
contexts: “★ ★ “★ ★ contrary”, and “* the contrary”, 

present elevated likelihood (D) values, indicating that the 
phrase “in the contrary” may have meaning and be worthy of 
dehnition. 


where it becomes clear that the topological structure of 
the associated network of contexts is crystalline, unlike 
the small-world phenomenon observed for the words of a 
thesaurus in [31]. However, this is not surprising, given 
that the latter is a conceptual network defined by com¬ 
mon meanings, as opposed to a rigid, physical property, 
such as word order. 


IV. PREDICTING MISSING DICTIONARY 
ENTRIES 

Starting with the work of Sinclair [32] (though the 
idea was proposed more than 10 years earlier by Beck¬ 
er in [3]), lexicographers have been building dictionaries 
based on language as it is spoken and written, includ¬ 
ing idiomatic, slang-filled, and grammatical expressions 
[33-36]. These dictionaries have proven highly-effective 
for non-primary language learners, who may not be privy 
to cultural metaphors. In this spirit, we utilize the con¬ 
text model derived above to discover phrases that are 
undefined, but which may be in need of definition for 
their similarity to other, defined phrases. We do this 
in a corpus-based way, using the definition likelihood 
D{C 1 s) as a secondary filter to frequency. The pro¬ 
cess is in general quite straightforward, and first requires 
a ranking of phrases by frequency of occurrence, /(s). 
Upon taking the first si,..., frequency-ranked phrases 
{N = 100,000, for our experiments), we reorder the list 
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according to the values D{C \ s) (descending). The top 
of such a double-sorted list then includes phrases that 
are both frequent and similar to defined phrases. 

With our double-sorted lists we then record those 
phrases having no definition or dictionary reference, but 
which are at the top. These phrases are quite often 
meaningful (as we have found experimentally, see below) 
despite their lack of definition, and as such we propose 
this method for the automated generation of short lists 
for editorial investigation of definition. 


V. MATERIALS AND METHODS 

For its breadth, open-source nature, and large editori¬ 
al community, we utilize dictionary data from the Wik- 
tionary [30] (a Wiki-based open content dictionary) to 
build the dictionary-indicator norm, setting D{s) = 1 if 
a phrase s has reference or redirect. 

We apply our filter for missing entry detection to sev¬ 
eral large corpora from a wide scope of content. These 
corpora are: twenty years of New York Times articles 
(NYT, 1987-2007) [37], approximately 4% of a year’s 
tweets (Twitter, 2009) [38], music lyrics from thousands 
of songs and authors (Lyrics, 1960-2007) [24], complete 
Wikipedia articles (Wikipedia, 2010) [39], and Project 
Gutenberg eBooks collection (eBooks, 2012) [40] of more 
than 30, 000 public-domain texts. We note that these are 
all unsorted texts, and that Twitter, eBooks, Lyrics, and 
to an extent, Wikipedia are mixtures of many languages 
(though majority English). We only attempt missing 
entry prediction for phrase lengths (2-5), for their inclu¬ 
sion in other major collocation corpora [19], as well as 
their having the most data in the dictionary. We also 
note that all text processed is taken lower-case. 

To understand our results, we perform a 10-fold cross- 
validation on the frequency and likelihood filters. This 
is executed by random splitting the Wiktionary’s list of 
defined phrases into 10 equal-length pieces, and then per¬ 
forming 10 parallel experiments In each of these experi¬ 
ments we determine the likelihood values, D{C \ s), by 
a distinct ^’s of the data. We then order the union set 
of the j^-withheld and the Wiktionary-undefined phras¬ 
es by their likelihood (and frequency) values descending, 
and accept some top segment of the list, or, ‘short list’, 
coding them as positive by the experiment. For such 
a short list, we then record the true positive rates, i.e., 
portion of all j^^-withheld truly-defined phrases we coded 
positive, the false positive rates, i.e., portion of all truly- 
undefined phrases we coded positive, and the number of 
entries discovered. Upon performing these experiments, 
the average of the ten trials is taken for each of the three 
parameters, for a number of short list lengths (scanning 
1, 000 log-spaced lengths), and plotted as a receiver oper¬ 
ating characteristic (ROC) curve (see Figs. 2-6). We also 
note that each is also presented with its area under curve 
(AUC), which measures the accuracy of the expanding- 
list classifier as a whole. 



Corpus 

2-gram 

3-gram 

4-gram 

5-gram 


Twitter 

4.22 (0.40) 

1.11 (0.30) 

0.90 (0.10) 

1.49 (0) 

> 

Jl 

NYT 

4.97 (0.30) 

0.36 (0.50) 

0.59 (0.10) 

1.60 (0) 

tn 

0 

Lyrics 

3.52 (0.50) 

1.76 (0.40) 

0.78 (0) 

0.48 (0) 

u 

Wikipedia 

5.06 (0.20) 

0.46 (0.80) 

1.94 (0.20) 

1.54 (0) 


eBooks 

3.64 (0.30) 

1.86 (0.30) 

0.59 (0.60) 

0.90 (0.10) 


Corpus 

2-gram 

3-gram 

4-gram 

5-gram 


Twitter 

6(0) 

4 (0) 

5 (0) 

5 (0) 

X 

0 

NYT 

5 (0) 

0 (0) 

2 (0) 

1 (0) 

0 ^ 

> 

Lyrics 

3 (0) 

1 (0) 

3 (0) 

1 (0) 

3 

Wikipedia 

0 (0) 

1 (0) 

1 (0) 

2 (0) 


eBooks 

2 (0) 

1 (0) 

3 (0) 

6 (1) 


TABLE II: Summarizing our results from the cross-validation 
procedure (Above), we present the mean numbers of miss¬ 
ing entries discovered when 20 guesses were made for N- 
grams/phrases of lengths 2, 3, 4, and 5, each. For each of 
the 5 large corpora (see Materials and Methods) we make 
predictions according our likelihood filter, and according to 
frequency (in parentheses) as a baseline. When consider¬ 
ing the 2-grams (for which the most definition information 
exists), short lists of 20 rendered up to 25% correct predic¬ 
tions on average by the definition likelihood, as opposed to 
the frequency ranking, by which no more than 2.5% could be 
expected. We also summarize the results to-date from the 
live experiment (Below) (updated February 19, 2015), and 
present the numbers of missing entries correctly discovered 
on the Wiktionary (i.e., reference added since July 1, 2014, 
when the dictionary’s data was accessed) by the 20-phrase 
shortlists produced in our experiments for both the likelihood 
and frequency (in parentheses) filters. Here we see that all of 
the corpora analyzed were generative of phrases, with Twit¬ 
ter far and away being the most productive, and the reference 
corpus Wikipedia the least so. 


VI. RESULTS AND DISCUSSION 

Before observing output from our model we take the 
time to perform a cross-validation (10-fold), and com¬ 
pare our context filter to a sort by frequency alone. 
From this we have found that our likelihood filter renders 
missing entries much more efficiently than by frequency 
(see Tab. II, and Figs. 2-6), already discovering miss¬ 
ing entries from short lists of as little as twenty (see the 
insets of Figs. 2-6 as well as Tabs. II, III, and IV-VII). As 
such we adhere to this standard, and only publish short 
lists of 20 predictions per corpus per phrase lengths 2-5. 
In parallel, we also present phrase frequency-generated 
short-lists for comparison. 

In addition to listing them in the appendices, we have 
presented the results of our experiment from across the 5 
large, disparate corpora on the Wiktionary in a pilot pro¬ 
gram, where we are tracking the success of the filters [41] . 
Looking at the lexical tables, where defined phrases are 
highlighted in red, we can see that many of the predic¬ 
tions by the likelihood filter (especially those obtained 
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rank 

2-gram 

3-gram 

4-gram 

5-gram 


1 

buenos noches 

knock it out 

in the same time 

actions speak louder then words 


2 

north york 

walk of fame 

on the same boat 

no sleep for the wicked 


3 

last few 

piece of mind 

about the same time 

every once and a while 


4 

holy hell 

seo-search engine optimization 

around the same time 

to the middle of nowhere 


5 

good am 

puta q pariu 

at da same time 

come to think about it 

0 

0 

6 

going away 

who the heck 

wat are you doing 

dont let the bedbugs bite 

£ 

7 

right up 

take it out 

wtf are you doing 

you get what i mean 


8 

go sox 

fim de mundo 

why are you doing 

you see what i mean 


9 

going well 

note to all 

hell are you doing 

you know who i mean 

.2 

*4J 

10 

due out 

in the moment 

better late then never 

no rest for the weary 

CCh 

11 

last bit 

note to myself 

here i go again 

as long as i know 

o; 

-o 

12 

go far 

check it here 

every now and again 

as soon as i know 


13 

right out 

check it at 

what were you doing 

going out on a limb 


14 

fuck am 

check it http 

was it just me 

give a person a fish 


15 

holy god 

check it now 

here we are again 

at a lost for words 


16 

rainy morning 

check it outhttp 

keeping an eye out 

de una vez por todas 


17 

picked out 

why the heck 

what in the butt 

onew kids on the block 


18 

south coast 

memo to self 

de vez em qdo 

twice in a blue moon 


19 

every few 

reminder to self 

giving it a try 

just what the dr ordered 


20 

picking out 

how the heck 

pain in my ass 

as far as we know 


rank 

2-gram 

3-gram 

4-gram 

5-gram 


1 

in the 

new blog post 

i just took the 

i favorited a youtube video 


2 

i just 

ijust took 

e meu resultado foi 

i uploaded a youtube video 


3 

of the 

live on http 

other people at http 

just joined a video chat 


4 

on the 

i want to 

check this video out 

fiddling with my blog post 


5 

i love 

i need to 

just joined a video 

joined a video chat with 


6 

i have 

i have a 

a day using http 

i rated a youtube video 


7 

i think 

quiz and got 

on my way to 

i just voted for http 

o 

C 

0 

s 

O' 

8 

to be 

thanks for the 

favorited a youtube video 

this site just gave me 

9 

i was 

what about you 

i favorited a youtube 

add a #twibbon to your 

0 

10 

if you 

i think i 

free online adult dating 

the best way to get 


11 

at the 

i have to 

a video chat with 

just changed my twitter background 


12 

have a 

looking forward to 

uploaded a youtube video 

a video chat at http 


13 

to get 

acabo de completar 

i uploaded a youtube 

photos on facebook in the 


14 

this is 

i love it 

video chat at http 

check it out at http 


15 

and i 

a youtube video 

what do you think 

own video chat at http 


16 

but i 

to go to 

i am going to 

s channel on youtube http 


17 

are you 

of the day 

if you want to 

and won in :#:mobsterworld http 


18 

it is 

what’ll you get 

i wish i could 

live stickam stream at http 


19 

i need 

my daily twittascope 

just got back from 

on facebook in the album 


20 

it was 

if you want 

thanks for the rt 

added myself to the http 


TABLE III: With data taken from the Twitter corpus, we present the top 20 unreferenced phrases considered for definition (in 
the live experiment) from each of the 2, 3, 4, and 5-gram likelihood filters (Above), and frequency filters (Below). From this 
corpus we note the juxtaposition of highly idiomatic expressions by the likelihood filter (like “holy hell”), with the domination 
of the frequency filters by semi-automated content. The phrase “holy hell” is an example of the model’s success with this 
corpus, as it achieved definition (February 8**', 2015) concurrently with the preparation of this manuscript (several months 
after the Wiktionary’s data was accessed in July, 2014). 


from the Twitter corpus) have already been defined in 
the Wiktionary following our recommendation (as of Feb. 
19th 2015) since we accessed its data in July of 2014 [30]. 
We also summarize these results from the live experiment 
in Tab. II. 


Looking at the lexical tables more closely, we note that 
all corpora present highly idiomatic expressions under 
the likelihood filter, many of which are variants of exist¬ 
ing idiomatic phrases that will likely be granted inclu¬ 
sion into the dictionary through redirects or alternative- 
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FIG. 2: With data taken from the Twitter corpus, we present 
(10-fold) cross-validation results for the filtration procedures. 
For each of the lengths 2, 3, 4, and 5, we show the ROC curves 
(Main Axes), comparing true and false positive rates for 
both the likelihood filters (black), and for the frequency filters 
(gray). There, we see increased performance in the likelihood 
classifiers (except possibly for length 5), which is reflected in 
the AUCs (where an AUC of 1 indicates a perfect classifi¬ 
er). We also monitor the average number of missing entries 
discovered as a function of the number of entries proposed 
(Insets), for each length. There, the horizontal dotted lines 
indicate the average numbers of missing entries discovered for 
both the likelihood filters (black) and for the frequency filters 
(gray) when short lists of 20 phrases were taken (red dotted 
vertical lines). From this we see an indication that even the 
5-gram likelihood filter is effective at detecting missing entries 
in short lists, while the frequency filter is not. 


forms listings. To name a few, the Twitter (Tab. Ill), 
Times (Tab. IV), and Lyrics (Tab. V) corpora consis¬ 
tently predict large families derived from phrases like “at 
the same time”, and “you know what i mean”, while the 
eBooks and Wikipedia corpora predict families derived 
from phrases like “on the other hand”, and “at the same 
time”. In general we see no such structure or predictive 
power emerge from the frequency hlter. 

We also observe that from those corpora which are less 
pure of English context (namely, the eBooks, Lyrics, and 
Twitter corpora), extra-English expressions have crept 
in. This highlights an important feature of the likeli¬ 
hood filter—it does not intrinsically rely on the syntax 
or grammar of the language to which it is applied, beyond 
the extent to which syntax and grammar effect the shapes 
of collocations. For example, the eBooks predict (see 
Tab. VII) the undefined French phrase “tu ne sais pas”, 
or “you do not know”, which is a syntactic variant of 
the English-Wiktionary defined French, “je ne sais pas”, 
meaning “i do not know”. Seeing this, we note that it 
would be straightforward to construct a likelihood hlter 
with a language indicator norm to create an alternative 
framework for language identihcation. 

There are also a fair number of phrases predicted by 
the likelihood hlter which in fact are spelling errors, 
typos, and grammatical errors. In terms of the con¬ 
text model, these erroneous forms are quite near to 
those dehned in the dictionary, and so rise in the short 
lists generated from the less-well edited corpora, e.g., 
“actions speak louder then words” in the Twitter corpus. 
This then seems to indicate the potential for the likeli¬ 
hood hlter to be integrated into auto-correct algorithms, 
and further points to the possibility of constructing syn¬ 
tactic indicator norms of phrases, making estimations of 
tenses and parts of speech (whose data is also available 
from the Wiktionary [30]) possible through application 
of the model in precisely the same manner presented in 
Sec. III. Regardless of the future applications, we have 
developed and presented a novel, powerful, and scalable 
MWE extraction technique. 
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Appendix A: Cross-validation results for missing entry detection 


1. The New York Times 



False positive 


FIG. 3: With data taken from the NYT corpus, we present (10-fold) cross-validation results for the filtration procedures. For 
each of the lengths 2, 3, 4, and 5, we show the ROC curves (Main Axes), comparing true and false positive rates for both the 
likelihood filters (black), and for the frequency filters (gray). There, we see increased performance in the likelihood classifiers 
(except possibly for length 5), which is reflected in the AUCs (where an AUC of 1 indicates a perfect classifier). We also 
monitor the average number of missing entries discovered as a function of the number of entries proposed (Insets), for each 
length. There, the horizontal dotted lines indicate the average numbers of missing entries discovered for both the likelihood 
Alters (black) and for the frequency filters (gray) when short lists of 20 phrases were taken (red dotted vertical lines). From 
this we see an indication that even the 5-gram likelihood Alter is effective at detecting missing entries in short lists, while the 
frequency Alter is not. 
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2. Music Lyrics 



FIG. 4: With data taken from the Lyrics corpus, we present (10-fold) cross-validation results for the hltration procedures. 
For each of the lengths 2, 3, 4, and 5, we show the ROC curves (Main Axes), comparing true and false positive rates for 
both the likelihood filters (black), and for the frequency filters (gray). There, we see increased performance in the likelihood 
classifiers, which is reflected in the AUCs (where an AUC of 1 indicates a perfect classifier). We also monitor the average 
number of missing entries discovered as a function of the number of entries proposed (Insets), for each length. There, the 
horizontal dotted lines indicate the average numbers of missing entries discovered for both the likelihood filters (black) and for 
the frequency filters (gray), when short lists of 20 phrases were taken (red dotted vertical lines). Here we can see that it may 
have been advantageous to construct a slightly longer 3 and 4-gram lists. 
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3. English Wikipedia 



FIG. 5: With data taken from the Wikipedia corpus, we present (10-fold) cross-validation results for the filtration procedures. 
For each of the lengths 2, 3, 4, and 5, we show the ROC curves (Main Axes), comparing true and false positive rates for 
both the likelihood filters (black), and for the frequency filters (gray). There, we see increased performance in the likelihood 
classifiers, which is reflected in the AUCs (where an AUC of 1 indicates a perfect classifier). We also monitor the average 
number of missing entries discovered as a function of the number of entries proposed (Insets), for each length. There, the 
horizontal dotted lines indicate the average numbers of missing entries discovered for both the likelihood filters (black) and for 
the frequency filters (gray) when short lists of 20 phrases were taken (red dotted vertical lines). Here we can see that it may 
have been advantageous to construct a slightly longer 3 and 4-gram lists. 
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4. Project Gutenberg eBooks 



FIG. 6: With data taken from the eBooks corpus, we present (10-fold) cross-validation results for the filtration procedures. 
For each of the lengths 2, 3, 4, and 5, we show the ROC curves (Main Axes), comparing true and false positive rates for 
both the likelihood filters (black), and for the frequency filters (gray). There, we see increased performance in the likelihood 
classifiers, which is reflected in the AUCs (where an AUC of 1 indicates a perfect classifier). We also monitor the average 
number of missing entries discovered as a function of the number of entries proposed (Insets), for each length. There, the 
horizontal dotted lines indicate the average numbers of missing entries discovered for both the likelihood filters (black) and for 
the frequency filters (gray) when short lists of 20 phrases were taken (red dotted vertical lines). Here we can see that the power 
of the 4-gram model does not show itself until longer lists are considered. 
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Appendix B: Tables of potential missing entries 
1. The New York Times 



rank 

2-gram 

3-gram 

4-gram 

5-gram 


1 

prime example 

as united states 

in the same time 

when push came to shove 


2 

going well 

in united states 

about the same time 

nat. ocean, and atm. admin. 


3 

south jersey 

by united states 

around the same time 

all’s well that ends well’ 


4 

north jersey 

eastern united states 

during the same time 

you see what i mean 


5 

united front 

first united states 

roughly the same time 

so far as i know 

0 

0 

6 

go well 

a united states 

return to a boil 

take it or leave it’ 

£ 

7 

gulf states 

to united states 

every now and again 

gone so far as to 

0) 

X 

8 

united germany 

for united states 

at the very time 

love it or leave it 

a 

9 

dining out 

senior united states 

nowhere to be seen 

as far as we’re concerned 

.2 

10 

north brunswick 

of united states 

for the long run 

as bad as it gets 

id 

CCh 

11 

go far 

from united states 

over the long run 

as far as he’s concerned 

0) 

TJ 

12 

going away 

is a result 

why are you doing 

days of wine and roses’ 


13 

there all 

and united states 

in the last minute 

as far as we know 


14 

picked out 

with united states 

to the last minute 

state of the county address 


15 

go all 

that united states 

until the last minute 

state of the state address 


16 

this same 

two united states 

remains to be done 

state of the city address 


17 

civil court 

its united states 

turn of the screw 

just a matter of time 


18 

good example 

assistant united states 

turn of the last 

be a matter of time 


19 

this instance 

but united states 

turn of the millennium 

for the grace of god 


20 

how am 

western united states 

once upon a mattress 

short end of the market 


rank 

2-gram 

3-gram 

4-gram 

5-gram 


1 

of the 

one of the 

in the united states 

at the end of the 


2 

in the 

in new york 

for the first time 

because of an editing error 


3 

he said 

the new york 

the new york times 

the new york stock exchange 


4 

and the 

some of the 

in new york city 

for the first time in 


5 

for the 

part of the 

at the end of 

he is survived by his 


6 

at the 

of new york 

the end of the 

is survived by his wife 


7 

in a 

president of the 

a spokesman for the 

an initial public offering of 

o 

a 

0 

s 

8 

to be 

the end of 

at the university of 

by the end of the 

9 

with the 

there is a 

one of the most 

the end of the year 

0 

10 

that the 

director of the 

of the united states 

the securities and exchange commission 


11 

it is 

it was a 

a member of the 

for the first time since 


12 

from the 

according to the 

the rest of the 

for students and the elderly 


13 

she said 

in the last 

at the age of 

beloved wife of the late 


14 

by the 

the white house 

to the united states 

he said in an interview 


15 

it was 

in the united 

in lieu of flowers 

the dow jones industrial average 


16 

as a 

the university of 

executive director of the 

the executive director of the 


17 

he was 

there is no 

the united states and 

tonight and tomorrow night at 


18 

is a 

it is a 

is one of the 

in the last two years 


19 

with a 

the first time 

of the new york 

in the new york times 


20 

and a 

in the first 

by the end of 

in the last few years 


TABLE IV: With data taken from the NYT corpus, we present the top 20 unreferenced phrases considered for definition (in 
the live experiment) from each of the 2, 3, 4, and 5-gram likelihood filters (Above), and frequency filters (Below). From 
this corpus we note the juxtaposition of highly idiomatic expressions by the likelihood hlter (like “united front”), with the 
domination of the frequency filters by structural elements of rigid content (e.g., the obituaries). The phrase “united front” is 
an example of the model’s success with this corpus, as it’s coverage in a Wikipedia article began in 2006, describing the general 
Marxist tactic extensively. We also note that we have abbreviated “national oceanographic and atmospheric administration” 
(Above), for brevity. 
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2. Music Lyrics 



rank 

2-gram 

3-gram 

4-gram 

5-gram 


1 

uh ha 

now or later 

one of a million 

when push come to shove 


2 

come aboard 

change of mind 

made up your mind 

come hell of high water 


3 

strung up 

over and done 

every now and again 

you see what i mean 


4 

fuck am 

forth and forth 

make up my mind 

you know that i mean 


5 

iced up 

in and down 

son of the gun 

until death do us part 

0 

0 

6 

merry little 

now and ever 

cry me a river-er 

that’s a matter of fact 

5 

7 

get much 

off the air 

have a good day 

it’s a matter of fact 


8 

da same 

on and go 

on way or another 

what goes around comes back 

a 

9 

messed around 

check it check 

for the long run 

you reap what you sew 

.2 

10 

old same 

stay the fuck 

feet on solid ground 

to the middle of nowhere 

"5 

11 

used it 

set the mood 

feet on the floor 

actions speak louder than lies 

0) 

TJ 

12 

uh yeah 

night to day 

between you and i 

u know what i mean 


13 

uh on 

day and every 

what in the hell 

ya know what i mean 


14 

fall around 

meant to stay 

why are you doing 

you’ll know what i mean 


15 

come one 

in love you 

you don’t think so 

you’d know what i mean 


16 

out much 

upon the shelf 

for better or for 

y’all know what i mean 


17 

last few 

up and over 

once upon a dream 

baby know what i mean 


18 

used for 

check this shit 

over and forever again 

like it or leave it 


19 

number on 

to the brink 

knock-knock-knockin’ on heaven’s door 

i know what i mean 


20 

come prepared 

on the dark 

once upon a lifetime 

ain’t no place like home 


rank 

2-gram 

3-gram 

4-gram 

5-gram 


1 

in the 

i want to 

la la la la 

la la la la la 


2 

and i 

la la la 

i don’t want to 

na na na na na 


3 

i don’t 

i want you 

na na na na 

on and on and on 


4 

on the 

you and me 

in love with you 

i want you to know 


5 

if you 

i don’t want 

i want you to 

don’t know what to do 


6 

to me 

i know you 

i don’t know what 

oh oh oh oh oh 


7 

to be 

i need you 

i don’t know why 

da da da da da 

U 

a 

8 

i can 

and i know 

oh oh oh oh 

do do do do do 


9 

and the 

i don’t wanna 

i want to be 

one more chance at love 

0 

10 

but i 

i got a 

know what to do 

i don’t want to be 


11 

of the 

i know that 

what can i do 

in the middle of the 


12 

i can’t 

you know i 

yeah yeah yeah yeah 

i don’t give a fuck 


13 

for you 

i can see 

you don’t have to 

yeah yeah yeah yeah yeah 


14 

when i 

and i don’t 

i close my eyes 

i don’t know what to 


15 

you can 

in your eyes 

you want me to 

all i want is you 


16 

i got 

and if you 

you make me feel 

you know i love you 


17 

in my 

the way you 

i just want to 

the middle of the night 


18 

all the 

na na na 

da da da da 

the rest of my life 


19 

i want 

don’t you know 

if you want to 

no no no no no 


20 

that i 

this is the 

come back to me 

at the end of the 


TABLE V: With data taken from the Lyrics corpus, we present the top 20 unreferenced phrases considered for definition 
(in the live experiment) from each of the 2, 3, 4, and 5-gram likelihood Hlters (Above), and frequency filters (Below). 
From this corpus we note the juxtaposition of highly idiomatic expressions by the likelihood Hlter (like “iced up”), with the 
domination of the frequency filters by various onomatopoeiae. The phrase “iced up” is an example of the model’s success with 
this corpus, having had definition in the Urban Dictionary since 2003, indicating that one is “covered in diamonds”. Further, 
though this phrase does have a variant that is dehned in the Wiktionary (as early as 2011)—“iced out”—we note that the 
reference is also made in the Urban Dictionary (as early as 2004), where the phrase has distinguished meaning for one that is 
so bedecked—ostentatiously. 
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3. English Wikipedia 



rank 

2-gram 

3-gram 

4-gram 

5-gram 


1 

new addition 

in respect to 

in the other hand 

the republic of the congo 


2 

african states 

as united states 

people’s republic of poland 

so far as i know 


3 

less well 

was a result 

people’s republic of korea 

going as far as to 


4 

south end 

walk of fame 

in the same time 

gone so far as to 


5 

dominican order 

central united states 

the republic of congo 

went as far as to 

0 

0 

6 

united front 

in united states 

at this same time 

goes as far as to 

£ 

7 

same-sex couples 

eastern united states 

at that same time 

the federal republic of yugoslavia 


8 

baltic states 

first united states 

approximately the same time 

state of the nation address 

c 

9 

to york 

a united states 

about the same time 

as far as we know 

.2 

10 

new kingdom 

under united states 

around the same time 

just a matter of time 

"S 

11 

east Carolina 

to united states 

during the same time 

due to the belief that 

CP 

12 

due east 

of united states 

roughly the same time 

as far as i’m aware 


13 

united church 

southern united states 

ho chi minh trail 

due to the fact it 


14 

quarter mile 

southeastern united states 

lesser general public license 

due to the fact he 


15 

end date 

southwestern united states 

in the last minute 

due to the fact the 


16 

so well 

and united states 

on the right hand 

as a matter of course 


17 

Olympic medalist 

th united states 

on the left hand 

as a matter of policy 


18 

at york 

western united states 

once upon a mattress 

as a matter of principle 


19 

go go 

for united states 

o caetano do sul 

or something to that effect 


20 

teutonic order 

former united states 

turn of the screw 

as fate would have it 


rank 

2-gram 

3-gram 

4-gram 

5-gram 


1 

of the 

one of the 

in the united states 

years of age or older 


2 

in the 

part of the 

at the age of 

the average household size was 


3 

and the 

the age of 

a member of the 

were married couples living together 


4 

on the 

the end of 

under the age of 

from two or more races 


5 

at the 

according to the 

the end of the 

at the end of the 


6 

for the 

may refer to 

at the end of 

the median income for a 


7 

he was 

member of the 

as well as the 

the result of the debate 


8 

it is 

the university of 

years of age or 

of it is land and 

3 

rv 

9 

with the 

in the early 

of age or older 

the racial makeup of the 


10 

as a 

a member of 

the population density was 

has a total area of 


11 

it was 

in the united 

the median age was 

the per capita income for 


12 

from the 

he was a 

as of the census 

and the average family size 


13 

the first 

of the population 

households out of which 

and the median income for 


14 

as the 

was born in 

one of the most 

the average family size was 


15 

was a 

end of the 

people per square mile 

had a median income of 


16 

in a 

in the late 

at the university of 

of all households were made 


17 

to be 

in addition to 

was one of the 

at an average density of 


18 

one of 

it is a 

for the first time 

males had a median income 


19 

during the 

such as the 

the result of the 

housing units at an average 


20 

with a 

the result was 

has a population of 

made up of individuals and 


TABLE VI: With data taken from the Wikipedia corpus, we present the top 20 unreferenced phrases considered for definition 
(in the live experiment) from each of the 2, 3, 4, and 5-gram likelihood filters (Above), and frequency filters (Below). From 
this corpus we note the juxtaposition of highly idiomatic expressions by the likelihood filter (like “same-sex couples”), with the 
domination of the frequency filters by highly-descriptive structural text from the presentations of demographic and numeric 
data. The phrase “same-sex couples” is an example of the model’s success with this corpus, and appears largely because of the 
existence distinct phrases “same-sex marriage” and “married couples” with definition in the Wiktionary. 
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4. Project Gutenberg eBooks 



rank 

2-gram 

3-gram 

4-gram 

5-gram 


1 

go if 

by and bye 

i ask your pardon 

handsome is that handsome does 


2 

come if 

purchasing power equivalent 

i crave your pardon 

for the grace of god 


3 

able man 

of the contrary 

with the other hand 

be that as it might 


4 

at york 

quite the contrary 

upon the other hand 

be that as it will 


5 

going well 

of united states 

about the same time 

up hill and down hill 

0 

0 

6 

there once 

so well as 

and the same time 

come to think about it 


7 

go well 

at a rate 

every now and again 

is no place like home 


8 

SO am 

point of fact 

tu ne sais pas 

for the love of me 

c 

9 

go all 

as you please 

quarter of an inch 

so far as i’m concerned 

.2 

10 

picked out 

SO soon as 

quarter of an ounce 

you know whom i mean 

c 

cn 

11 

very same 

it a rule 

quarter of an hour’s 

you know who i mean 

0 

12 

come all 

so to bed 

qu’il ne fallait pas 

upon the face of it 


13 

look well 

of a hurry 

to the expense of 

you understand what i mean 


14 

there all 

at the rate 

be the last time 

you see what i mean 


15 

how am 

such a hurry 

and the last time 

by the grace of heaven 


16 

going away 

just the way 

was the last time 

by the grace of the 


17 

going forth 

it all means 

is the last time 

don’t know what i mean 


18 

get much 

you don’t know 

so help me heaven 

be this as it may 


19 

why am 

greater or less 

make up my mind 

in a way of speaking 


20 

this same 

have no means 

at the heels of 

or something to that effect 


rank 

2-gram 

3-gram 

4-gram 

5-gram 


1 

of the 

one of the 

for the first time 

at the end of the 


2 

and the 

it was a 

at the end of 

and at the same time 


3 

it was 

there was a 

of the united states 

the other side of the 


4 

on the 

out of the 

the end of the 

on the part of the 


5 

it is 

it is a 

the rest of the 

distributed proofreading team at http 


6 

to be 

i do not 

one of the most 

on the other side of 


7 

he was 

it is not 

on the other side 

at the foot of the 

y 

C 

CP 

8 

at the 

and it was 

for a long time 

percent of vote by party 

9 

for the 

it would be 

it seems to me 

at the head of the 

CP 

10 

with the 

he did not 

it would have been 

as a matter of course 


11 

he had 

there was no 

as well as the 

on the morning of the 


12 

by the 

and in the 

i am going to 

for the first time in 


13 

he said 

that he was 

as soon as the 

it seems to me that 


14 

in a 

it was not 

i should like to 

president of the united states 


15 

with a 

it was the 

as a matter of 

at the bottom of the 


16 

and i 

that he had 

on the part of 

i should like to know 


17 

that the 

there is no 

the middle of the 

but at the same time 


18 

of his 

that it was 

the head of the 

at the time of the 


19 

i have 

he had been 

at the head of 

had it not been for 


20 

and he 

but it was 

the edge of the 

at the end of a 


TABLE VII: With data taken from the eBooks corpus, we present the top 20 unreferenced phrases considered for definition 
(in the live experiment) from each of the 2, 3, 4, and 5-gram likelihood filters (Above), and frequency filters (Below). From 
this corpus we note the juxtaposition of many highly idiomatic expresisons by the likelihood filter, with the domination of the 
frequency filters by highly-structural text. Here, since the texts are all within the public domain, we see that this much less 
modern corpus is without the innovation present in the other, but that the likelihood filter does still extract many unreferenced 
variants of Wiktionary-dehned idiomatic forms. 















